HMA heterogeneity paper
Evaluation of normalisation and batch correction methods (integrating three batches)
This report performs and assesses QC filtering, normalisation and batch correction of scRNAseq data (from scNMT dataset) across 3 batches of treated and untreated single cells.
library(knitr)
THIS STEP IS ALREADY DONE
This is based directly on the annotation (GTF) file used The output of TEtranscripts is ensembl ID and counts. Therefore, this table provides a way to update the output with the information already contained within the gene GTF file (annotation file) used in the TEtranscripts run. This code was based on https://www.biostars.org/p/140471/
N.B. These should be the same gene GTF files used in the TEtranscripts step!
library(rtracklayer)
library(dplyr)
library(readr)
# Version 1 (gencode)
GTF_file <- "./scripts/annos/GTF_files/genes/gencode.v30.annotation_MT_noChr_corrected.gtf" #v2 was using gencode.v30.annotation_NoChr.gtf (only minor update - 'MT' name)
newGTF <- import(GTF_file)
newGTF2 <- as.data.frame(newGTF)
# Extract only genes and export standardised table (used for downstream analysis)
newGTF2 <- newGTF2 %>%
filter(type == "gene") %>%
dplyr::select(Geneid = gene_id, GeneSymbol = gene_name,
Chromosome = seqnames, Start = start, End = end,
Class = gene_type, Strand = strand, Length = width) %>%
mutate(Chromosome = gsub("chr", "", Chromosome)) ## Removes 'chr' in case GTF file was from ensembl (and not already removed)
write_tsv(newGTF2, file = "./scripts/annos/gencode.v30_gene_annotation_table3.txt")
THIS STEP IS ALREADY DONE
See:
* RNAseq count table(s):
/research/canepi/Sequencing_Data_Raw/2_Processed/200804_KU_NextSeq_scNMT-scRNAseq/TEtranscripts/060320/Passing_QC/TE_and_gene_counts_Updated_and_Raw_QCd.txt
* RNAseq count table(s):
/research/canepi/Sequencing_Data_Raw/2_Processed/200804_KU_NextSeq_scNMT-scRNAseq/TEtranscripts/250620/Passing_QC/TE_and_gene_counts_Updated_and_Raw_QCd.txt
* RNAseq count table(s):
/research/canepi/Sequencing_Data_Raw/2_Processed/220606_HL_NOVAseq_scRNAseq_HL/TEtranscripts/Passing_QC/TE_and_gene_counts_Updated_and_Raw_QCd.txt
This has been updated to automatically detect and merge files if given a path/filename that matches more than one file.
source("./scripts/Create_sce_from_TEtranscritps_raw_counts_scNMT.R")
create_sce_scNMT(input.counts = "./1_data/scNMT/RNAseq/TE_and_gene_counts_Updated_and_Raw", * Will automatically detect and try to merge all files with a matching common basename.
sample.sheet = "./1_data/scNMT/QC_Summary_combined_NMT_all_batches.csv",
sample.sheet.name.column = "Sample",
output.directory = "./2_results/scNMT/RNAseq")
This is to run the QC and filtering (of features) script:
# This includes initial QC and secondary filtered QC
source("./scripts/Perform_sce_QC_and_filter-seqmonk_based.R")
perform_QC_scRNAseq(input.file = "./2_results/scNMT/RNAseq/sce.rds",
output.file.dir = './2_results/scNMT/RNAseq',
output.fig = 'figures/scNMT/RNAseq/QC',
min.counts = 5,
min.cells = NA,
min.cells.percentage = 10, # If a number is entered here, 'min.cells' is ignored.
batch.order.col = c("b320" = "#D55E00", "b620" = "#009E73", "b322" = "#0072B2"))
The inititial QC assessment is utilising the scran::scater() package.
There is clear difference in library size across the three batches, particularly comparing the third batch (b322) with the first two. | This is further highlighted in the second plot, with the first batch (b320) having both a lower library size and an on average lower number of features detected.
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC1-libsizes.png')
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC1-scatter.png')
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC1-combined.png')
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC1-PCA.png')
Samples are filtered based on the results from the seqmonk QC (see scNMT_QC markdown) that produced a list of samples passing QC. Then genes/TEs are filter for above thresholds to remove missing/lowly expressed genes, by requiring x amount of reads in at least Y% cells/samples.
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC2-combined.png')
include_graphics('./figures/scNMT/RNAseq//QC/scNMT_QC2-PCA.png')
include_graphics('./figures/scNMT/RNAseq//QC/scNMT_QC2-PCA-Genes_TEs_separate.png')
scater can also compute the marginal R2 for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal R2 values for the variables. | Each line corresponds to one factor and represents the distribution of R2 values across all genes. The factors being regressed/ plotted are ranked and ordered (as are their colours), so the ‘top’ factor (in blue) in the legend is the one that explains ‘the most’ amount of variance in the dataset.
include_graphics('./figures/scNMT/RNAseq/QC/scNMT_QC1.2-explanatoryvariables.png')
Here, we can see ‘batch’ accounts for more variance in expression than treatment in both the initial QC and post filtering. After filtering, number of genes and TEs detected and batch still explain the highest amounts of variance. The subset of mitochondria explains the least amount of variance across this data. Therefore, batch correction needs to applied.
https://satijalab.org/seurat/articles/integration_introduction.html
‘Seurat v4 includes a set of methods to match (or ‘align’) shared
cell populations across datasets. These methods first identify
cross-dataset pairs of cells that are in a matched biological state
(‘anchors’), can be used both to correct for technical differences
between datasets (i.e. batch effect correction), and to perform
comparative scRNA-seq analysis of across experimental conditions.’
Seurat uses it’s own normalisation approach + batch correction by:
- Normalise batches separately (vst or SCTransform) and identify ‘x
HVGs’. - Identify the common HVGs across the three batches to uses as
‘anchors’ - Uses the identified ‘anchors’ with canonical correlation
analysis (CCA) + MNN to batch correct (need further reading)
Optimal method found and used was Seurat SCTransform + integration (normalisation + batch correction)
source("./scripts/converted_from_HPC/batch_correction_Seurat.R")
batch_correction_seurat4(input.file = "2_results/scNMT/RNAseq/sce_qc2.rds",
output.file = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4.rds",
output.fig.dir = 'figures/scNMT/RNAseq/Batch_Correction/Seurat/',
assay.name = "Seurat", # This is the assay name in the output RDS.
batch.column = 'Batch',
seurat.norm = "SCTransform", # Can be either 'SCTransform' or 'vst'. seurat.SCTransform.vst.flavor = "v2", # NOT CODED YET. Only used if 'seurat.norm = SCTransform'.
nhvgs = 5000, # Number of highly variable genes used for anchoring detection and integration (retained in output batch corrected rds file)
k.weight = 50, # seurat default is 100. Adjusted for lower throughput scNMT data (96 well plates + QC requirements).
set.the.seed = TRUE,
force = FALSE)
Update integrated assay with NAs for features/cells that previously contained ‘missing’ data, and were imputed.
source("./scripts/batch_correction_Seurat.R")
seurat_update_with_NAs(input.seurat.rds = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4.rds",
output.file = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4_NAs.rds",
assay.with.missing.values = "RNA",
assay.to.add.missing.values = "integrated",
new.assay.name = "integrated_NAs")
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_batch_vs_tx.png')
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_PCA.png')
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_UMAP.png')
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_TEs.png')
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_TE_Expr_overlap.png')
nvhgs <- 5000
norm <- 'SCTransform'
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/sce_qc2_Seurat4_SCTransform_5000_TEs_Violin-Ridge_v2.png')
source("./scripts/converted_from_HPC/convert_seurat_to_sce.R")
convert_seurat_to_sce(sce.original = "2_results/scNMT/RNAseq/sce_qc2.rds",
seurat.corrected = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4_NAs.rds",
output.sce = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4_NAs_sce.rds",
compare.raw.vs.corrected = TRUE,
output.fig.dir = "./figures/scNMT/RNAseq/Batch_Correction/Seurat/")
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/ExplanatoryVariables-integrated.png')
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/ExplanatoryVariables-raw_vs_integrated.png')
Compare feature counts for batch corrected vs normalised and ‘raw’
source("./scripts/Plot_RNAseq_correction_comparison.R")
plot_corrected_counts_comparison(input.seurat.rds = "2_results/scNMT/RNAseq/Batch_Correction/Seurat/Seurat4.rds",
output.fig = "figures/scNMT/RNAseq/Batch_Correction/Seurat/Seurat_correction_counts_comparison.png",
raw.assay = "RNA",
norm.assay = "SCT",
corrected.assay = "integrated",
batch.order = c("b320","b620","b322"),
batch.names = c("060320" = "b320", "250620" = "b620", "220331" = "b322"))
Each point shows the mean counts for a feature across cells by treatment in a given batch. This compares counts for raw –> normalised –> batch corrected.
include_graphics('./figures/scNMT/RNAseq/Batch_Correction/Seurat/Seurat_correction_counts_comparison.png')